Prediction of global marginal land resources for Pistacia chinensis Bunge by a machine learning method

Biofuel has attracted worldwide attention due to its potential to combat climate change and meet emission reduction targets. Pistacia chinensis Bunge (P. chinensis) is a prospective plant for producing biodiesel. Estimating the global potential marginal land resources for cultivating this species would be conducive to exploiting bioenergy yielded from it. In this study, we applied a machine learning method, boosted regression tree, to estimate the suitable marginal land for growing P. chinensis worldwide. The result indicated that most of the qualified marginal land is found in Southern Africa, the southern part of North America, the western part of South America, Southeast Asia, Southern Europe, and eastern and southwest coasts of Oceania, for a grand total of 1311.85 million hectares. Besides, we evaluated the relative importance of the environmental variables, revealing the major environmental factors that determine the suitability for growing P. chinensis, which include mean annual water vapor pressure, mean annual temperature, mean solar radiation, and annual cumulative precipitation. The potential global distribution of P. chinensis could provide a valuable basis to guide the formulation of P. chinensis-based biodiesel policies.


Methods
The study is carried out through the following steps: firstly, we applied BRT models combined with the presence records and a set of environmental predictors to predict the environmental suitability for P. chinensis; secondly, we identified the major environmental factors that determine the suitability for P. chinensis by evaluating the relative contribution rate of each environmental variable; thirdly, we overlayed the land-use map with the suitable regions obtained from the first step to extract the marginal land suitable for growing P. chinensis, and analyzed the land-use composition of qualified marginal land resources.
We acquired the geographical distribution data of P. chinensis from the Global Biodiversity Information Facility (GBIF, http:// www. gbif-uat. org/), which contains 1923 georeferenced records. These records were aggregated up to grid cells with a common spatial resolution of 0.05°, and 670 samples were obtained as presence records. The same number of points were selected as pseudo-absence records based on the broad environmental suitability of P. chinensis. Specifically, the pseudo-absence records were randomly selected within areas where the mean annual temperature is either lower than 5.8 °C or beyond 28.4 °C, and where the annual mean precipitation is less than 400 mm or more than 1900 mm, these areas have previously been shown unsuitable for P. chinensis survive 15 .
The set of environmental predictors consists of 3 types: climate, soil, and topography, which are closely associated with the distribution of P. chinensis. We included mean annual water vapor pressure, mean annual temperature, mean solar radiation, and annual cumulative precipitation to reflect the climatic conditions for P. chinensis. These data were derived from the WorldClim version 2.0 database (http:// www. world clim. com/). Soil water content, soil class, and soil depth were obtained from the World Soil Information (http:// www. isric. org/) to mirror the soil characteristics. The elevation and slope data from the website of the Consultative Group on International Agricultural Research (CGIAR) Consortium for Spatial Information (http:// srtm. csi. cgiar. org) were used to simulate the topographical conditions. Besides, land-use data obtained from NASA's Earth Observatory Group (https:// lpdaac. usgs. gov/) were used to distinguish marginal lands. All above raster data were resampled to the same coordinate system with a spatial resolution of 0.05° using ArcGIS 10.2 software 21 .
Boosted regression tree (BRT) models were applied to quantify the relationship between environmental predictors and occurrence records of P. chinensis as well as predict its potential distribution. As an advanced machine learning technique, BRTs have been extensively used in ecological studies. BRTs use regression trees and gradient boosting to iteratively fit and combine multiple regression tree models to improve stability and predictive accuracy 22 . The virtue of BRTs is their ability to model complex ecological systems for (1) neatly accommodating continuous and categorical predictors, missing data, and outliers without requiring data pre-transformation, and (2) calculating complex nonlinear relationships between the variables 23 .
All BRTs were fitted in R 3.3.3 24 , using the 'dismo' and 'gbm' library 25,26 . We ran the performed 30 bootstrapped BRT models with a random sample of 1340 grid cells and averaged the results from simulations to ensure reliability. In addition, we used the area under the curve (AUC) of the receiver operating characteristic (ROC) curve to evaluate the performance of BRT models 27,28 . We also applied ten-fold cross-validation procedures to select the optimal number of regression trees for each model using the "gbm.step" function and to avoid overfitting. We used the relative contribution (RC) percentage to evaluate the relevance of each predictor in BRT models and generated partial dependence plots to describe the effect of each predictor on the potential distribution of P. chinensis, while other independent variables were taken as mean or constant. Each cross-validation BRT model built with training data (50%) yields a cross-validated AUC, while its model predictions with testing data (50%) provide the estimation of a testing AUC.

Results
Accuracy of global suitability prediction for P. chinensis. Layering the global occurrence points of our plant on its environmental suitability map (Fig. 1), we find the consistency that these points of existing plants mostly appear in regions of high suitability, which could indicate the good performance of BRT model.
To further evaluate the accuracy of our simulation, we calculated the statistic index under tenfold crossvalidation procedure, obtaining the training data AUC = 0.995 ± 0.001 and validation data AUC = 0.995 ± 0.003, which could give credibility to the outputs of the BRT model.
In addition, we used the standard deviation values to quantify the uncertainty of the spatial prediction. The visualized quantification in Fig. 2 presented relatively low uncertainty, which could further validate the results of model simulation. www.nature.com/scientificreports/ Relative contribution of the environmental covariates. Given that the accuracy of our model prediction has been validated through multiple means, the reliability of variables' relative contribution in the simulating process were therefore guaranteed.

Potential marginal land resources suitable for P. chinensis. Global distribution of potential marginal
land resources for P. chinensis. A necessary step prior to estimating the worldwide potential marginal land available for proper cultivation of P. chinensis is evaluating the global suitability for growing this plant.
In the global map ( Fig. 3), we distinguished the environmentally suitable regions for growing P. chinensis, covering a broad span of latitudes from 45 degrees north and 45 degrees south. Generally, the map exhibits a descending trend in suitability from coasts toward inland, excluding a few suitable regions in Africa and areas along borders between countries.   www.nature.com/scientificreports/ To identify potential land resources for planting, we determined whether a 5 × 5 km 2 grid cell is suitable for P. chinensis by setting 0.5 as the value of the threshold. After that, when distinguishing potential marginal land out of all land resources for growing P. chinensis, it is our understanding that in case of compromising environment preservation and productivity, we should include only a few land-use types: savannas, shrublands, and grasslands. The resulting map is presented in Fig. 4.
The quantified results show that globally, there are 1311.85 million hectares of marginal land in total for proper P. chinensis cultivation, mostly distributed in Southern Africa, the southern part of North America, the western part of South America, Southeast Asia, Southern Europe, and eastern and southwest coasts of Oceania.
Africa, though no P. chinensis has been found there yet, has the greatest amount of land resources for the planting, which is 528.19 million hectares in total, for growing this plant. According to the map, the majority of Angola, Zambia, Mozambique, Tanzania, Zimbabwe, and Malawi, and a considerable proportion of Zaire, South Africa, Ethiopia, Madagascar, and Kenya has potential for future cultivation. South America has the next largest amount of marginal land to Africa. The total 304.44 million hectares are mainly distributed in West Brazil, central Argentina, Uruguay, Paraguay. We also find a relatively small portion of land resources in Peru, Bolivia, Chile, Ecuador, Venezuela, and Colombia. North America possesses 222.99 million hectares of plantable marginal land for our plant, mainly found in the United States and Mexico. Asia, where P. chinensis is native to, only owns third to last marginal land for growing it. The majority of the total 106.18 million hectare land resources are from countries in Southeast Asia, including China, Myanmar, and Vietnam. Also, part of Turkey, India, Iran, Iraq, and the bordering region between Pakistan and Afghanistan are also found suitable for potential cultivation. In Oceania, marginal land resources are mostly distributed in coastal regions. The total amount of land resources is 101.04 million hectares, of which 97.99% are found in Australia. In Europe, we find the least land resources among all continents, which are mainly distributed in Spain, Portugal, Ireland, the United Kingdom, Greece, Italy, and France, amounting to 49.01 million hectares.
Global land-use composition of potential land resources for P. chinensis. Classification of the global marginal land resources by land-use type could reveal the land-use structure of available land for P. chinensis all over the globe, indicating that the land resources mostly consist of woody savannas, savannas, grasslands, and open shrublands, leaving only a tiny fraction of marginal land as closed shrublands.
Our calculation shows that woody savannas take up the largest proportion of land resources, accounting for 37.72% (494.80 million hectares) of the total. In addition, savannas (358.71 million hectares), grasslands (257.10 million hectares), and open shrublands (198.65 million hectares) share fair proportions of all the plantable land resources, which are 27.34%, 19.60%, and 15.14%, respectively. In comparison, close shrublands (2.59 million hectares) make an almost negligible contribution to the marginal land for planting, sharing 0.20% of the total.
Moreover, we managed to calculate the percentage of all types of land use for planting P. chinensis in each continent, respectively, presented in the pie charts made below (Fig. 5). It is not surprising that we find woody savannas have the leading position in marginal land resources of several continents, including Africa, Asia, Europe, and Oceania. The following type is savannas, comprising the largest proportion of marginal land in South America.
Furthermore, quantification results for countries with more than 10 million hectares of land resources for P. chinensis cultivation in all continents are itemized in descending order by size in Table 2. The table shows that 9 of the total 21 countries in the list are from Africa, amounting to 443.95 million hectares, accounting for 33.84% of the total land resources. However, the country with the largest amount of available marginal land is Brazil (190.58 million hectares) from South America, which is mainly comprised of savannas. In terms of the land-use type  www.nature.com/scientificreports/ in the leading position of these countries, 7 of the 21 countries are primarily distributed with woody savannas (236.83 million hectares in total), while 6 of them are mainly distributed with grasslands (156.90 million hectares), followed by 5 countries with savannas (235.73 million hectares), 3 countries with open shrublands (87.58 million hectares). The results on the national scale are generally consistent with that on the continental scale, indicating the substantial share of countries in Africa as well as woody savannas in total available marginal land.

Discussion
In this study, an advanced machine learning algorithm, boosted regression trees, is used to obtain the potential spatial distribution of P. chinensis, integrated with consideration of crop growing conditions and marginal land types. Although the occurrence records for P. chinensis are mainly located in China, the United States, and Australia, the potential marginal land suitable for P. chinensis is abundant in many other states, amounting to a total of 1311.85 million hectares. Fu  The potential spatial distribution of P. chinensis obtained in this study refers to the macroscopic distribution from the perspective of environmental suitability. With the result of total cultivatable marginal land in this study, fruit oil content and yields per hectare, biodiesel conversion rate we found from previous literature, we managed to roughly calculate the global biodiesel produced from P. chinensis, which ranges from 375.54 to 667.31 billion liters assuming that 30% of the marginal land resources can be used for P. chinensis cultivation 7,29-31 . Given the GHG (greenhouse gas) abatement offered per liter of biodiesel used in lieu of gasoline or diesel, using P. chinensis-based biodiesel could reduce 0.27 to 1.95 gigaton CO 2 e emissions 32 , meeting 0.98% to 6.96% of the annual reduced emissions goal (28 GtCO2e) to limit global warming to 1.5 °C. More optimistic estimation is to use 50% of available marginal land to grow P. chinensis, the obtained 625.90 to 1112.19 billion liters of P. chinensis-based biodiesel used to replace gasoline or diesel could reduce 0.46 to 3.25 gigaton CO 2 e emissions, meeting 1.63% to 11.60% of the annual reduced emissions goal.
However, when zooming in and focusing on a particular area, specific local policies should be considered, which include ecological protection, economic and environmental benefits, etc., to ensure the sustainable development of bioenergy in each region [33][34][35] . For example, despite the results in this study suggesting that Africa has the most marginal land resources available for energy development by far, it is not suitable for developing P. chinensis based biodiesel on a large scale because its current condition fails to meet the criteria of modern bioenergy technologies 36 . Furthermore, the development of P. chinensis should also balance food, fodder, and fuel supply in land use planning in Africa 37 . In addition, for the economic and environmental benefits, the life cycle assessment of P. chinensis-based biodiesel should be considered to determine the development area of marginal land 38,39 .
The limited occurrence records are the major limitation of this study. The global potential spatial distribution of marginal land suitable for P. chinensis would be more precise if more samples were recorded and available for download.

Conclusion
As one of the prospective energy plants, estimating the potential marginal land resources for P. chinensis cultivation would be conducive to exploiting bioenergy yielded from this species. Boosted regression tree (BRT), the machine learning method we applied in this study, enabled us to evaluate the relative importance of the environmental variables, revealing the primary influence on defining the qualification for growing P. chinensis of several indicators that represent climate conditions, including mean annual water vapor pressure, mean annual temperature, mean solar radiation, and annual cumulative precipitation.
Moreover, aided by this model, we map the suitability for growing P. chinensis across the planet, based on which we derive the marginal land resources distributed in each continent. Most of the qualified marginal land is found in Southern Africa, the southern part of North America, the western part of South America, Southeast Asia, Southern Europe, and eastern and southwest coasts of Oceania, for a grand total of 1311.85 million hectares. www.nature.com/scientificreports/